Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version

Cut Document (Text Processing)

Synopsis

Cuts an input document into segments using regular expressions specifiying start and end of segments.

Description

This operator segments a text based on a starting and ending regular expression.

Input

  • document

Output

  • documents (Collection)

    Collection of the segmented document.

Parameters

  • query type Specifies the type of the query. The available query types are: String Matching, Regular Expression, Regular Region, Indexed, XPath and JSONPath; Range: selection
  • string matching queries Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching. Range: list
  • attribute type Specifies the type of the resulting attributes. If numerical or binomial is chosen, ensure that the returned result is interpretable. The available types are: Nominal, Numerical and Binominal; Range: selection
  • regular expression queries Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions. Range: list
  • regular region queries Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result. Range: list
  • xpath queries Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath. Range: list
  • namespaces Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range: list
  • ignore CDATA Indicates if CDATA should be ignored when using the XPATH expression. Range: boolean
  • assume html If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range: boolean
  • index queries Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match. Range: list
  • jsonpath queries Specifies a list of attribute names and their corresponding JSONPath queries. Range: list